{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HAP Transform Example Notebook\n",
    "=====================================\n",
    "\n",
    "This notebook processes a CSV file containing text data to analyze for Hate, Abuse, and Profanity (HAP) scores.\n",
    "It converts the CSV file into Parquet format, uses the `hap_local_python.py` script to calculate HAP scores, \n",
    "and generates outputs for further analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "### Overview\n",
    "This notebook demonstrates the use of the HAP transformation to annotate documents with a `hap_score`, \n",
    "indicating the likelihood of Hate, Abuse, or Profanity in the text.\n",
    "\n",
    "### Workflow\n",
    "The HAP process consists of:\n",
    "1. **Sentence Splitting**: Documents are split into sentences using NLTK.\n",
    "2. **HAP Annotation**: Each sentence is scored between 0 and 1 (1 = high HAP, 0 = no HAP).\n",
    "3. **Aggregation**: The document's final HAP score is the maximum score among all sentences.\n",
    "\n",
    "\n",
    "### Configuration\n",
    "- **Model Name**: IBM Granite Guardian (`ibm-granite/granite-guardian-hap-38m` by default).\n",
    "- **Document Text Column** (`--doc_text_column`): Specify the input column containing document text to generate the hap_score against. Defaults to `contents`.\n",
    "- **Annotation Column** (`--annotation_column`): Specify the output column for HAP scores. Defaults to `hap_score`.\n",
    "\n",
    "\n",
    "### Steps in This Notebook\n",
    "1. Define paths and import libraries.\n",
    "2. Convert CSV input to Parquet.\n",
    "3. Run the HAP transformation script.\n",
    "4. View and analyze the results.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "make venv \n",
    "source venv/bin/activate \n",
    "pip install jupyterlab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install  'data-prep-toolkit[ray]==0.2.2'\n",
    "! pip install  'data-prep-toolkit-transforms[all]==0.2.2'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import pandas as pd\n",
    "import subprocess\n",
    "\n",
    "from data_processing.runtime.pure_python import PythonTransformLauncher\n",
    "from data_processing.utils import ParamsUtils\n",
    "from hap_transform_python import HAPPythonTransformConfiguration\n",
    "from hap_transform import HAPTransform"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Setup runtime parameters\n",
    "\n",
    "- input-folder: Path to the input data to be used by the transform.\n",
    "- output-folder: Path where the output file with HAP scores will be saved.\n",
    "- doc_text_column: The column containing the text for analysis (For ex.: `Customer Feedback`).\n",
    "- annotation_column: The column where HAP scores will be saved (default: `hap_score`).\n",
    "\n",
    "**Customization**: \n",
    "- Ensure the column containing the text matches the `doc_text_column` parameter.\n",
    "- If your text column has a different name, update the value of `--doc_text_column` accordingly.\n",
    "- You can adjust other parameters like `--batch_size` and `--max_length` if needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# create parameters\n",
    "input_csv = './input-csv'\n",
    "input_folder = './input-parquet'\n",
    "output_folder = './output'\n",
    "output_parquet_file = os.path.join(output_folder, \"customer-feedback.parquet\")\n",
    "output_csv_file = os.path.join(output_folder, \"customer-feedback.csv\")\n",
    "\n",
    "local_conf = {\n",
    "    \"input_folder\": input_folder,\n",
    "    \"output_folder\": output_folder,\n",
    "}\n",
    "\n",
    "print(input_folder)\n",
    "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
    "params = {\n",
    "    # Data access. Only required parameters are specified\n",
    "    \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
    "    # execution info\n",
    "    \"runtime_pipeline_id\": \"pipeline_id\",\n",
    "    \"runtime_job_id\": \"job_id\",\n",
    "    \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n",
    "    # hap params\n",
    "    \"doc_text_column\": \"Customer Feedback\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Setup Input Data\n",
    "\n",
    "- Place your CSV file in the `input_csv`.\n",
    "- Ensure the column containing the text matches the `doc_text_column` parameter.\n",
    "- If your text column has a different name, update the `doc_text_column` parameter in later cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clear the existing input-parquet folder\n",
    "if os.path.exists(input_folder):\n",
    "    for file_name in os.listdir(input_folder):\n",
    "        file_path = os.path.join(input_folder, file_name)\n",
    "        try:\n",
    "            os.remove(file_path)\n",
    "            print(f\"Deleted file: {file_path}\")\n",
    "        except Exception as e:\n",
    "            print(f\"Failed to delete {file_path}: {e}\")\n",
    "else:\n",
    "    os.makedirs(input_csv, exist_ok=True)  # Updated line\n",
    "    os.makedirs(input_folder, exist_ok=True)  # Updated line\n",
    "    print(f\"Created folder: {input_csv}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csv_files = [f for f in os.listdir(input_csv) if f.endswith(\".csv\")]\n",
    "\n",
    "if not csv_files:\n",
    "    print(f\"No CSV files found in the input folder: {input_csv}\")\n",
    "    print(\"Please place a CSV file in the input folder and rerun this script.\")\n",
    "else:\n",
    "    # Pick the first CSV file in the folder\n",
    "    csv_file_path = os.path.join(input_csv, csv_files[0])\n",
    "    print(f\"Using CSV file: {csv_file_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Convert CSV to Parquet\n",
    "Convert the selected CSV file to Parquet format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "parquet_file_path = os.path.join(input_folder, \"customer-feedback.parquet\")\n",
    "df = pd.read_csv(csv_file_path)\n",
    "df.to_parquet(parquet_file_path, index=False)\n",
    "print(f\"CSV file converted to Parquet format at: {parquet_file_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4: Invoke the transform\n",
    "Use python runtime to invoke the transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
    "launcher = PythonTransformLauncher(runtime_config=HAPPythonTransformConfiguration())\n",
    "launcher.launch()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#The specified folder will include the transformed parquet files.\n",
    "\n",
    "import glob\n",
    "glob.glob(\"python/output/*\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 5: Display Output in a Readable Format\n",
    "\n",
    "This step checks for any existing CSV files in the output folder and removes them before generating new ones. The following actions are performed:\n",
    "\n",
    "1. **Listing Output Files**: The script lists all files in the output folder.\n",
    "2. **Check for Parquet Files**: It identifies `.parquet` files in the output folder.\n",
    "3. **Remove Old CSV Files**: If any previous output files (`hap_complete_output.csv` or `hap_filtered_output.csv`) exist, they are deleted.\n",
    "4. **Read Parquet File**: The Parquet file is read into a DataFrame.\n",
    "5. **Filter Data**: The relevant columns, `doc_text_column` (from the environment variable) and `hap_score_column`, are selected from the DataFrame.\n",
    "6. **CSV Output**: Convert the parquet output to CSV \n",
    "7. **Display Output**: Display Output in the notebook for a quick reference \n",
    "\n",
    "Example Output:\n",
    "<div>\n",
    "<style scoped>\n",
    "    .dataframe tbody tr th:only-of-type {\n",
    "        vertical-align: middle;\n",
    "    }\n",
    "\n",
    "    .dataframe tbody tr th {\n",
    "        vertical-align: top;\n",
    "    }\n",
    "\n",
    "    .dataframe thead th {\n",
    "        text-align: left;\n",
    "    }\n",
    "</style>\n",
    "<table border=\"0\" class=\"dataframe\">\n",
    "  <thead>\n",
    "    <tr style=\"text-align: left;\">\n",
    "      <th></th>\n",
    "      <th>Customer Feedback</th>\n",
    "      <th>hap_score</th>\n",
    "    </tr>\n",
    "  </thead>\n",
    "  <tbody>\n",
    "    <tr>\n",
    "      <th>0</th>\n",
    "      <td>Rating: 4 Comments: \"Service was prompt, but ...</td>\n",
    "      <td>0.000195</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>1</th>\n",
    "      <td>Rating: 5 Comments: \"Great help from Peter! H...</td>\n",
    "      <td>0.000153</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>2</th>\n",
    "      <td>Rating: 3 Comments: \"The service was quick, b...</td>\n",
    "      <td>0.000169</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>3</th>\n",
    "      <td>Rating: 5 Comments: \"Excellent service and ad...</td>\n",
    "      <td>0.000158</td>\n",
    "    </tr>\n",
    "  </tbody>\n",
    "</table>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Locate transformed Parquet files in the output folder\n",
    "output_parquet_files = [f for f in os.listdir(output_folder) if f.endswith(\".parquet\")]\n",
    "\n",
    "if output_parquet_files:\n",
    "    # Clear existing CSV files in the output folder\n",
    "    for file_name in os.listdir(output_folder):\n",
    "        if file_name.endswith(\".csv\"):\n",
    "            file_path = os.path.join(output_folder, file_name)\n",
    "            try:\n",
    "                os.remove(file_path)\n",
    "                print(f\"Deleted old CSV file: {file_path}\")\n",
    "            except Exception as e:\n",
    "                print(f\"Failed to delete {file_path}: {e}\")\n",
    "\n",
    "    for output_parquet in output_parquet_files:\n",
    "        original_parquet_path = os.path.join(output_folder, output_parquet)\n",
    "        try:\n",
    "            # Rename the Parquet file\n",
    "            os.rename(original_parquet_path, output_parquet_file)\n",
    "            print(f\"Renamed Parquet file to: {output_parquet_file}\")\n",
    "\n",
    "            # Convert the renamed Parquet file to CSV\n",
    "            transformed_df = pd.read_parquet(output_parquet_file)\n",
    "            transformed_df.to_csv(output_csv_file, index=False)\n",
    "            print(f\"Transformed CSV file saved at: {output_csv_file}\")\n",
    "\n",
    "            # Display selected columns in tabular format\n",
    "            if 'Customer Feedback' in transformed_df.columns and 'hap_score' in transformed_df.columns:\n",
    "                display_df = transformed_df[['Customer Feedback', 'hap_score']]\n",
    "                print(\"\\nSelected Columns (Customer Feedback and HAP Score):\")\n",
    "                from IPython.display import display  # Ensure pretty display in Jupyter\n",
    "                display(display_df.head(10))  # Display the first 10 rows\n",
    "            else:\n",
    "                print(\"The required columns ('Customer Feedback' and 'hap_score') are not in the transformed data.\")\n",
    "\n",
    "        except Exception as e:\n",
    "            print(f\"Error processing files: {e}\")\n",
    "else:\n",
    "    print(f\"No Parquet files found in the output folder: {output_folder}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dataprepkit",
   "language": "python",
   "name": "data-prep-kit"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
